The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit card services would lead the bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and the reason for same – so that the bank could improve upon those areas.
You as a Data Scientist at Thera Bank need to explore the data provided, identify patterns, and come up with a classification model to identify customers likely to churn, and provide actionable insights and recommendations that will help the bank improve its services so that customers do not renounce their credit cards.
CLIENTNUM: Client number. Unique identifier for the customer holding the accountAttrition_Flag : Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"Customer_Age : Age in YearsGender : The gender of the account holderDependent_count : Number of dependentsEducation_Level : Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.Marital_Status : Marital Status of the account holderIncome_Category : Annual Income Category of the account holderCard_Category : Type of CardMonths_on_book : Period of relationship with the bankTotal_Relationship_Count : Total no. of products held by the customerMonths_Inactive_12_mon : No. of months inactive in the last 12 monthsContacts_Count_12_mon : No. of Contacts between the customer and bank in the last 12 months
Credit_Limit : Credit Limit on the Credit CardTotal_Revolving_Bal : The balance that carries over from one month to the next is the revolving balanceAvg_Open_To_Buy : Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)Total_Trans_Amt : Total Transaction Amount (Last 12 months)Total_Trans_Ct : Total Transaction Count (Last 12 months)Total_Ct_Chng_Q4_Q1 : Ratio of the total transaction count in 4th quarter and the total transaction count in the 1st quarterTotal_Amt_Chng_Q4_Q1 : Ratio of the total transaction amount in 4th quarter and the total transaction amount in the 1st quarterAvg_Utilization_Ratio : Represents how much of the available credit the customer spent#libraries to read and manipulate data
import numpy as np
import pandas as pd
#libraries to visualize data
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data = pd.read_csv("/content/drive/MyDrive/AIML/Featurization_Model_Selection & Tunning/BankChurners.csv")
df = data.copy()
First and last 5 rows of the dataset
# looking at head (first 5 observations)
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
# looking at tail (last 5 observations)
df.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | ... | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | ... | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | ... | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | ... | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | ... | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
5 rows × 21 columns
#Checking the shape of the dataset
df.shape
(10127, 21)
#Checking the data types of the columns for the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
#checking for duplicates
df.isna().sum()
| 0 | |
|---|---|
| CLIENTNUM | 0 |
| Attrition_Flag | 0 |
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 1519 |
| Marital_Status | 749 |
| Income_Category | 0 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.0 | 7.391776e+08 | 3.690378e+07 | 708082083.0 | 7.130368e+08 | 7.179264e+08 | 7.731435e+08 | 8.283431e+08 |
| Customer_Age | 10127.0 | 4.632596e+01 | 8.016814e+00 | 26.0 | 4.100000e+01 | 4.600000e+01 | 5.200000e+01 | 7.300000e+01 |
| Dependent_count | 10127.0 | 2.346203e+00 | 1.298908e+00 | 0.0 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 |
| Months_on_book | 10127.0 | 3.592841e+01 | 7.986416e+00 | 13.0 | 3.100000e+01 | 3.600000e+01 | 4.000000e+01 | 5.600000e+01 |
| Total_Relationship_Count | 10127.0 | 3.812580e+00 | 1.554408e+00 | 1.0 | 3.000000e+00 | 4.000000e+00 | 5.000000e+00 | 6.000000e+00 |
| Months_Inactive_12_mon | 10127.0 | 2.341167e+00 | 1.010622e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Contacts_Count_12_mon | 10127.0 | 2.455317e+00 | 1.106225e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Credit_Limit | 10127.0 | 8.631954e+03 | 9.088777e+03 | 1438.3 | 2.555000e+03 | 4.549000e+03 | 1.106750e+04 | 3.451600e+04 |
| Total_Revolving_Bal | 10127.0 | 1.162814e+03 | 8.149873e+02 | 0.0 | 3.590000e+02 | 1.276000e+03 | 1.784000e+03 | 2.517000e+03 |
| Avg_Open_To_Buy | 10127.0 | 7.469140e+03 | 9.090685e+03 | 3.0 | 1.324500e+03 | 3.474000e+03 | 9.859000e+03 | 3.451600e+04 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 7.599407e-01 | 2.192068e-01 | 0.0 | 6.310000e-01 | 7.360000e-01 | 8.590000e-01 | 3.397000e+00 |
| Total_Trans_Amt | 10127.0 | 4.404086e+03 | 3.397129e+03 | 510.0 | 2.155500e+03 | 3.899000e+03 | 4.741000e+03 | 1.848400e+04 |
| Total_Trans_Ct | 10127.0 | 6.485869e+01 | 2.347257e+01 | 10.0 | 4.500000e+01 | 6.700000e+01 | 8.100000e+01 | 1.390000e+02 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 7.122224e-01 | 2.380861e-01 | 0.0 | 5.820000e-01 | 7.020000e-01 | 8.180000e-01 | 3.714000e+00 |
| Avg_Utilization_Ratio | 10127.0 | 2.748936e-01 | 2.756915e-01 | 0.0 | 2.300000e-02 | 1.760000e-01 | 5.030000e-01 | 9.990000e-01 |
df.describe(include = 'object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
df.select_dtypes(include = 'object').nunique()
| 0 | |
|---|---|
| Attrition_Flag | 2 |
| Gender | 2 |
| Education_Level | 6 |
| Marital_Status | 3 |
| Income_Category | 6 |
| Card_Category | 4 |
#Dropping the client ID column
df.drop(columns = 'CLIENTNUM', inplace = True)
object_col = df.select_dtypes(include='object').columns.tolist()
df[object_col] = df[object_col].astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
def boxhist_plot(data, column, figsize = (16,6)):
"""
data: dataframe
column: column name
figsize: size of figure
"""
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=figsize) # Create a figure and a set of subplots
sns.boxplot(data[column], ax=ax1) # Plot the boxplot on the first subplot (ax1)
sns.histplot(data[column], ax=ax2) # Plot the histogram on the second subplot (ax2)
sns.kdeplot(data[column], ax=ax3) # Plot the kernel density estimate on the third subplot (ax3)
sns.violinplot(data[column], ax=ax4) # Plot the violin plot on the fourth subplot (ax4)
ax2.axvline(
data[column].median(),
color='green',
linestyle='dashed',
linewidth=2,
label='Mean'
)
ax2.axvline(
data[column].mean(),
color='red',
linestyle='dashed',
linewidth=2,
label='Mean'
)
ax1.set_title(f'Boxplot of {column}') # Set the title of the first subplot
ax2.set_title(f'Histogram of {column}') # Set the title of the second subplot
plt.show() # Display the plot
# function to create labeled barplots
def labeled_barplot(data, feature, feature_2, order, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
feature_2: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90)
ax = sns.countplot(
data=data,
x=feature,
palette='coolwarm',
order=order,
hue=feature_2,
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=9,
xytext=(0, 5),
textcoords="offset points"
) # annotate the percentage
plt.show() # show the plot
boxhist_plot(df, 'Customer_Age')
boxhist_plot(df, 'Dependent_count')
boxhist_plot(df, 'Months_on_book')
boxhist_plot(df, 'Total_Relationship_Count')
boxhist_plot(df, 'Months_Inactive_12_mon')
boxhist_plot(df, 'Contacts_Count_12_mon')
boxhist_plot(df, 'Credit_Limit')
boxhist_plot(df, 'Total_Revolving_Bal')
boxhist_plot(df, 'Avg_Open_To_Buy')
boxhist_plot(df, 'Total_Trans_Amt')
boxhist_plot(df, 'Total_Trans_Ct')
boxhist_plot(df, 'Total_Ct_Chng_Q4_Q1')
boxhist_plot(df, 'Total_Amt_Chng_Q4_Q1')
boxhist_plot(df, 'Avg_Utilization_Ratio')
labeled_barplot(df, 'Gender', 'Attrition_Flag', order=df.Gender.value_counts().index)
labeled_barplot(df, 'Education_Level', 'Attrition_Flag', order=df.Education_Level.value_counts().index)
labeled_barplot(df, 'Marital_Status', 'Attrition_Flag', order=df.Marital_Status.value_counts().index)
labeled_barplot(df, 'Income_Category', 'Attrition_Flag', order=df.Income_Category.value_counts().index)
labeled_barplot(df, 'Card_Category', 'Attrition_Flag', order=df.Card_Category.value_counts().index)
num_col = df.select_dtypes(include='number').columns.tolist()
num_col
['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
sns.pairplot(df[num_col], diag_kind='kde', corner=True)
<seaborn.axisgrid.PairGrid at 0x7d3e6b974a60>
def boxplot_with_target(data: pd.DataFrame, numeric_columns, target, include_outliers):
subplot_cols = 2
subplot_rows = int(len(numeric_columns) / 2 + 1)
plt.figure(figsize=(16, 3 * subplot_rows))
for i, col in enumerate(numeric_columns):
plt.subplot(8, 2, i + 1)
sns.boxplot(
data=data,
x=target,
y=col,
orient="vertical",
palette="Blues",
showfliers=include_outliers,
)
plt.xlabel(target, fontsize=12)
plt.ylabel(col, fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()
boxplot_with_target(df, num_col, 'Attrition_Flag', include_outliers=True)
Attrited Customers tend to have:
Lower Total_Trans_Amt and Total_Trans_Ct values. This means they make fewer and smaller transactions compared to Existing Customers.
boxplot_with_target(df, num_col, 'Attrition_Flag', include_outliers=False)
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Customer Demographics and Behavior:
Transaction Patterns:
#Outlier Detection
plt.figure(figsize=(10,30))
for i, variable in enumerate(num_col):
plt.subplot(8,2,i+1)
sns.boxplot(df[num_col[i]])
plt.tight_layout()
plt.title(variable)
plt.show()
X = df.drop(columns='Attrition_Flag', axis=1)
y = df['Attrition_Flag']
X = pd.get_dummies(X, drop_first=True)
X.shape
(10127, 30)
X.replace(True, 1, inplace = True)
X.replace(False, 0, inplace = True)
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 10127 non-null int64 1 Dependent_count 10127 non-null int64 2 Months_on_book 10127 non-null int64 3 Total_Relationship_Count 10127 non-null int64 4 Months_Inactive_12_mon 10127 non-null int64 5 Contacts_Count_12_mon 10127 non-null int64 6 Credit_Limit 10127 non-null float64 7 Total_Revolving_Bal 10127 non-null int64 8 Avg_Open_To_Buy 10127 non-null float64 9 Total_Amt_Chng_Q4_Q1 10127 non-null float64 10 Total_Trans_Amt 10127 non-null int64 11 Total_Trans_Ct 10127 non-null int64 12 Total_Ct_Chng_Q4_Q1 10127 non-null float64 13 Avg_Utilization_Ratio 10127 non-null float64 14 Gender_M 10127 non-null int64 15 Education_Level_Doctorate 10127 non-null int64 16 Education_Level_Graduate 10127 non-null int64 17 Education_Level_High School 10127 non-null int64 18 Education_Level_Post-Graduate 10127 non-null int64 19 Education_Level_Uneducated 10127 non-null int64 20 Marital_Status_Married 10127 non-null int64 21 Marital_Status_Single 10127 non-null int64 22 Income_Category_$40K - $60K 10127 non-null int64 23 Income_Category_$60K - $80K 10127 non-null int64 24 Income_Category_$80K - $120K 10127 non-null int64 25 Income_Category_Less than $40K 10127 non-null int64 26 Income_Category_abc 10127 non-null int64 27 Card_Category_Gold 10127 non-null int64 28 Card_Category_Platinum 10127 non-null int64 29 Card_Category_Silver 10127 non-null int64 dtypes: float64(5), int64(25) memory usage: 2.3 MB
from sklearn.model_selection import train_test_split
# Splitting data into training, validation and test set:
# first we split data into 2 parts, temporary and test
X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.30, random_state=40, stratify=y )
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=0.30, random_state=40, stratify=y_temp )
#confirm the shape of both data sets and the ratio of classes is the same across train, validation and test datasets
print("Shape of Training set : ", X_train.shape)
print("Shape of validation set : ", X_val.shape)
print("Shape of test set : ", X_test.shape)
print(' ')
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print(' ')
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print(' ')
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (4961, 30) Shape of validation set : (2127, 30) Shape of test set : (3039, 30) Percentage of classes in training set: Attrition_Flag Existing Customer 0.839347 Attrited Customer 0.160653 Name: proportion, dtype: float64 Percentage of classes in validation set: Attrition_Flag Existing Customer 0.83921 Attrited Customer 0.16079 Name: proportion, dtype: float64 Percentage of classes in test set: Attrition_Flag Existing Customer 0.839421 Attrited Customer 0.160579 Name: proportion, dtype: float64
#libraries for metrics and statistics
from sklearn import metrics
import scipy.stats as stats
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score, roc_auc_score
def model_score(model,X_test, y_test):
'''
model_score : extract performed model score in single code snippet
'''
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
true_positive = cm[0][0]
false_positive = cm[0][1]
false_negative = cm[1][0]
true_negative = cm[1][1]
Precision = true_positive/(true_positive+false_positive)
Recall = true_positive/(true_positive+false_negative)
F1_Score = 2*(Recall * Precision) / (Recall + Precision)
print("Confusion Matrix: ", cm)
print("Accuracy Score: ", acc)
print("Precision Score: ", Precision)
print("Recall Score: ", Recall)
print("F1 Score: ", F1_Score)
1. Decision Tree Model
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(random_state=40)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=40)
print(dtree.score(X_train, y_train))
print(dtree.score(X_val, y_val))
1.0 0.9459332393041843
dtree_modelscore = model_score(dtree, X_test, y_test)
Confusion Matrix: [[ 392 96] [ 104 2447]] Accuracy Score: 0.9341888779203685 Precision Score: 0.8032786885245902 Recall Score: 0.7903225806451613 F1 Score: 0.7967479674796747
2. Bagging
from sklearn.ensemble import BaggingClassifier
bagging = BaggingClassifier(random_state=40)
bagging.fit(X_train, y_train)
BaggingClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(random_state=40)
print(bagging.score(X_train, y_train))
print(bagging.score(X_val, y_val))
0.9977827050997783 0.9619181946403385
bagging_modelscore = model_score(bagging, X_test, y_test)
Confusion Matrix: [[ 415 73] [ 69 2482]] Accuracy Score: 0.9532741033234616 Precision Score: 0.8504098360655737 Recall Score: 0.8574380165289256 F1 Score: 0.8539094650205761
3. Boosting
3.1 Ada Boost
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(random_state=40)
adaboost.fit(X_train, y_train)
AdaBoostClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=40)
print(adaboost.score(X_train, y_train))
print(adaboost.score(X_val, y_val))
0.9621044144325741 0.9576868829337094
adaboost_modelscore = model_score(adaboost, X_test, y_test)
Confusion Matrix: [[ 404 84] [ 53 2498]] Accuracy Score: 0.9549193813754524 Precision Score: 0.8278688524590164 Recall Score: 0.8840262582056893 F1 Score: 0.8550264550264551
3.2 Gradient Boost
from sklearn.ensemble import GradientBoostingClassifier
gradientboost = GradientBoostingClassifier(random_state=40)
gradientboost.fit(X_train, y_train)
GradientBoostingClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=40)
print(gradientboost.score(X_train, y_train))
print(gradientboost.score(X_val, y_val))
0.975609756097561 0.9680300893276916
gradientboost_modelscore = model_score(gradientboost, X_test, y_test)
Confusion Matrix: [[ 395 93] [ 30 2521]] Accuracy Score: 0.9595261599210266 Precision Score: 0.8094262295081968 Recall Score: 0.9294117647058824 F1 Score: 0.8652792990142388
4. RandomForest
from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(random_state=40)
randomforest.fit(X_train, y_train)
RandomForestClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=40)
print(randomforest.score(X_train, y_train))
print(randomforest.score(X_val, y_val))
1.0 0.9543958627174424
randomforest_modelscore = model_score(randomforest, X_test, y_test)
Confusion Matrix: [[ 373 115] [ 34 2517]] Accuracy Score: 0.9509707140506746 Precision Score: 0.764344262295082 Recall Score: 0.9164619164619164 F1 Score: 0.8335195530726257
df_over = df.copy()
X = df_over.drop(columns=['Attrition_Flag', 'Education_Level', 'Marital_Status'], axis=1)
y = df_over['Attrition_Flag']
X = pd.get_dummies(X, drop_first=True)
X.replace(True, 1, inplace = True)
X.replace(False, 0, inplace = True)
y.replace('Attrited Customer', 1, inplace = True)
y.replace('Existing Customer', 0, inplace = True)
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 10127 non-null int64 1 Dependent_count 10127 non-null int64 2 Months_on_book 10127 non-null int64 3 Total_Relationship_Count 10127 non-null int64 4 Months_Inactive_12_mon 10127 non-null int64 5 Contacts_Count_12_mon 10127 non-null int64 6 Credit_Limit 10127 non-null float64 7 Total_Revolving_Bal 10127 non-null int64 8 Avg_Open_To_Buy 10127 non-null float64 9 Total_Amt_Chng_Q4_Q1 10127 non-null float64 10 Total_Trans_Amt 10127 non-null int64 11 Total_Trans_Ct 10127 non-null int64 12 Total_Ct_Chng_Q4_Q1 10127 non-null float64 13 Avg_Utilization_Ratio 10127 non-null float64 14 Gender_M 10127 non-null int64 15 Income_Category_$40K - $60K 10127 non-null int64 16 Income_Category_$60K - $80K 10127 non-null int64 17 Income_Category_$80K - $120K 10127 non-null int64 18 Income_Category_Less than $40K 10127 non-null int64 19 Income_Category_abc 10127 non-null int64 20 Card_Category_Gold 10127 non-null int64 21 Card_Category_Platinum 10127 non-null int64 22 Card_Category_Silver 10127 non-null int64 dtypes: float64(5), int64(18) memory usage: 1.8 MB
# Splitting data into training, validation and test set:
# first we split data into 2 parts, temporary and test
X_temp, X_testover, y_temp, y_testover = train_test_split( X, y, test_size=0.30, random_state=40, stratify=y )
# then we split the temporary set into train and validation
X_train, X_valover, y_train, y_valover = train_test_split( X_temp, y_temp, test_size=0.30, random_state=40, stratify=y_temp )
# To oversample data
from imblearn.over_sampling import SMOTE
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE( sampling_strategy="minority", k_neighbors=10, random_state=40 )
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 797 Before UpSampling, counts of label 'No': 4164 After UpSampling, counts of label 'Yes': 4164 After UpSampling, counts of label 'No': 4164 After UpSampling, the shape of train_X: (8328, 23) After UpSampling, the shape of train_y: (8328,)
1. Decision Tree Model
dtree = DecisionTreeClassifier(random_state=40)
dtree.fit(X_train_over, y_train_over)
DecisionTreeClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=40)
print(dtree.score(X_train_over, y_train_over))
print(dtree.score(X_valover, y_valover))
1.0 0.9280677009873061
dtreeover_modelscore = model_score(dtree, X_testover, y_testover)
Confusion Matrix: [[2393 158] [ 75 413]] Accuracy Score: 0.9233300427772293 Precision Score: 0.9380635045080361 Recall Score: 0.9696110210696921 F1 Score: 0.9535764096433553
2. Bagging
bagging = BaggingClassifier(random_state=40)
bagging.fit(X_train_over, y_train_over)
BaggingClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(random_state=40)
print(bagging.score(X_train_over, y_train_over))
print(bagging.score(X_valover, y_valover))
0.9983189241114313 0.9431123648330982
baggingover_modelscore = model_score(bagging, X_testover, y_testover)
Confusion Matrix: [[2449 102] [ 73 415]] Accuracy Score: 0.9424152681803225 Precision Score: 0.960015680125441 Recall Score: 0.9710547184773989 F1 Score: 0.9655036467573428
3. Boosting
3.1 Ada Boost
adaboost = AdaBoostClassifier(random_state=40)
adaboost.fit(X_train_over, y_train_over)
AdaBoostClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=40)
print(adaboost.score(X_train_over, y_train_over))
print(adaboost.score(X_valover, y_valover))
0.9612151777137368 0.9402914903620122
adaboostover_modelscore = model_score(adaboost, X_testover, y_testover)
Confusion Matrix: [[2407 144] [ 64 424]] Accuracy Score: 0.9315564330371833 Precision Score: 0.9435515484123873 Recall Score: 0.9740995548360988 F1 Score: 0.9585822381521307
3.2 Gradient Boost
gradientboost = GradientBoostingClassifier(random_state=40)
gradientboost.fit(X_train_over, y_train_over)
GradientBoostingClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=40)
print(gradientboost.score(X_train_over, y_train_over))
print(gradientboost.score(X_valover, y_valover))
0.9780259365994236 0.9543958627174424
gradientboostover_modelscore = model_score(gradientboost, X_testover, y_testover)
Confusion Matrix: [[2439 112] [ 56 432]] Accuracy Score: 0.9447186574531096 Precision Score: 0.9560956487651902 Recall Score: 0.9775551102204408 F1 Score: 0.9667063020214032
4. RandomForest
randomforest = RandomForestClassifier(random_state=40)
randomforest.fit(X_train_over, y_train_over)
RandomForestClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=40)
print(randomforest.score(X_train_over, y_train_over))
print(randomforest.score(X_valover, y_valover))
1.0 0.9567465914433474
randomforestover_modelscore = model_score(randomforest, X_testover, y_testover)
Confusion Matrix: [[2473 78] [ 83 405]] Accuracy Score: 0.9470220467258966 Precision Score: 0.9694237553900431 Recall Score: 0.9675273865414711 F1 Score: 0.9684746426473467
# To undersample data
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=40)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_under == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_under == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_under.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_under.shape))
Before Under Sampling, counts of label 'Yes': 797 Before Under Sampling, counts of label 'No': 4164 After Under Sampling, counts of label 'Yes': 797 After Under Sampling, counts of label 'No': 797 After Under Sampling, the shape of train_X: (1594, 23) After Under Sampling, the shape of train_y: (1594,)
1. Decision Tree Model
dtree = DecisionTreeClassifier(random_state=40)
dtree.fit(X_train_under, y_train_under)
DecisionTreeClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=40)
print(dtree.score(X_train_under, y_train_under))
print(dtree.score(X_valover, y_valover))
1.0 0.8951574988246357
dtreeunder_modelscore = model_score(dtree, X_testover, y_testover)
Confusion Matrix: [[2270 281] [ 61 427]] Accuracy Score: 0.8874629812438302 Precision Score: 0.8898471187769502 Recall Score: 0.9738309738309738 F1 Score: 0.9299467431380581
2. Bagging
bagging = BaggingClassifier(random_state=40)
bagging.fit(X_train_under, y_train_under)
BaggingClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(random_state=40)
print(bagging.score(X_train_under, y_train_under))
print(bagging.score(X_valover, y_valover))
0.993099121706399 0.9355900329102022
baggingunder_modelscore = model_score(bagging, X_testover, y_testover)
Confusion Matrix: [[2366 185] [ 51 437]] Accuracy Score: 0.9223428759460349 Precision Score: 0.9274794198353586 Recall Score: 0.9788994621431527 F1 Score: 0.9524959742351048
3. Boosting
3.1 Ada Boost
adaboost = AdaBoostClassifier(random_state=40)
adaboost.fit(X_train_under, y_train_under)
AdaBoostClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=40)
print(adaboost.score(X_train_under, y_train_under))
print(adaboost.score(X_valover, y_valover))
0.9554579673776662 0.9337094499294781
adaboostunder_modelscore = model_score(adaboost, X_testover, y_testover)
Confusion Matrix: [[2351 200] [ 36 452]] Accuracy Score: 0.9223428759460349 Precision Score: 0.9215993727949824 Recall Score: 0.9849183074989527 F1 Score: 0.9522073714054273
3.2 Gradient Boost
gradientboost = GradientBoostingClassifier(random_state=40)
gradientboost.fit(X_train_under, y_train_under)
GradientBoostingClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=40)
print(gradientboost.score(X_train_under, y_train_under))
print(gradientboost.score(X_valover, y_valover))
0.9849435382685069 0.9431123648330982
gradientboostunder_modelscore = model_score(gradientboost, X_testover, y_testover)
Confusion Matrix: [[2392 159] [ 32 456]] Accuracy Score: 0.937150378413952 Precision Score: 0.9376715013720109 Recall Score: 0.9867986798679867 F1 Score: 0.9616080402010051
4. RandomForest
randomforest = RandomForestClassifier(random_state=40)
randomforest.fit(X_train_under, y_train_under)
RandomForestClassifier(random_state=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=40)
print(randomforest.score(X_train_under, y_train_under))
print(randomforest.score(X_valover, y_valover))
1.0 0.9355900329102022
randomforestover_modelscore = model_score(randomforest, X_testover, y_testover)
Confusion Matrix: [[2366 185] [ 42 446]] Accuracy Score: 0.9253043764396183 Precision Score: 0.9274794198353586 Recall Score: 0.9825581395348837 F1 Score: 0.9542246420649324
# Create a list of model names and scores
model_names = ['Decision Tree', 'Bagging', 'AdaBoost', 'Gradient Boost', 'Random Forest',
'Decision Tree (Oversampled)', 'Bagging (Oversampled)', 'AdaBoost (Oversampled)',
'Gradient Boost (Oversampled)', 'Random Forest (Oversampled)',
'Decision Tree (Undersampled)', 'Bagging (Undersampled)', 'AdaBoost (Undersampled)',
'Gradient Boost (Undersampled)', 'Random Forest (Undersampled)']
# Create a list of dictionaries, where each dictionary represents a model's scores
model_scores = [
{'Accuracy': 0.934, 'Precision': 0.962, 'Recall': 0.958, 'F1-Score': 0.960}, # Decision Tree
{'Accuracy': 0.962, 'Precision': 0.983, 'Recall': 0.973, 'F1-Score': 0.978}, # Bagging
{'Accuracy': 0.957, 'Precision': 0.972, 'Recall': 0.970, 'F1-Score': 0.971}, # AdaBoost
{'Accuracy': 0.961, 'Precision': 0.979, 'Recall': 0.974, 'F1-Score': 0.976}, # Gradient Boost
{'Accuracy': 0.964, 'Precision': 0.987, 'Recall': 0.972, 'F1-Score': 0.980}, # Random Forest
{'Accuracy': 0.911, 'Precision': 0.917, 'Recall': 0.975, 'F1-Score': 0.945}, # Decision Tree (Oversampled)
{'Accuracy': 0.956, 'Precision': 0.965, 'Recall': 0.981, 'F1-Score': 0.973}, # Bagging (Oversampled)
{'Accuracy': 0.934, 'Precision': 0.934, 'Recall': 0.987, 'F1-Score': 0.960}, # AdaBoost (Oversampled)
{'Accuracy': 0.950, 'Precision': 0.954, 'Recall': 0.985, 'F1-Score': 0.969}, # Gradient Boost (Oversampled)
{'Accuracy': 0.960, 'Precision': 0.967, 'Recall': 0.984, 'F1-Score': 0.975}, # Random Forest (Oversampled)
{'Accuracy': 0.825, 'Precision': 0.882, 'Recall': 0.887, 'F1-Score': 0.885}, # Decision Tree (Undersampled)
{'Accuracy': 0.864, 'Precision': 0.903, 'Recall': 0.918, 'F1-Score': 0.910}, # Bagging (Undersampled)
{'Accuracy': 0.892, 'Precision': 0.915, 'Recall': 0.943, 'F1-Score': 0.929}, # AdaBoost (Undersampled)
{'Accuracy': 0.896, 'Precision': 0.916, 'Recall': 0.945, 'F1-Score': 0.930}, # Gradient Boost (Undersampled)
{'Accuracy': 0.898, 'Precision': 0.919, 'Recall': 0.944, 'F1-Score': 0.931} # Random Forest
]
# Create a DataFrame from the list of dictionaries
df_model_scores = pd.DataFrame(model_scores, index=model_names)
# Display the DataFrame
df_model_scores
# Sorting models in decreasing order of test recall
df_model_scores.sort_values(
by=["Accuracy", "Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
| Accuracy | Precision | Recall | F1-Score | |
|---|---|---|---|---|
| Random Forest | 0.964000 | 0.987000 | 0.972000 | 0.980000 |
| Bagging | 0.962000 | 0.983000 | 0.973000 | 0.978000 |
| Gradient Boost | 0.961000 | 0.979000 | 0.974000 | 0.976000 |
| Random Forest (Oversampled) | 0.960000 | 0.967000 | 0.984000 | 0.975000 |
| AdaBoost | 0.957000 | 0.972000 | 0.970000 | 0.971000 |
| Bagging (Oversampled) | 0.956000 | 0.965000 | 0.981000 | 0.973000 |
| Gradient Boost (Oversampled) | 0.950000 | 0.954000 | 0.985000 | 0.969000 |
| AdaBoost (Oversampled) | 0.934000 | 0.934000 | 0.987000 | 0.960000 |
| Decision Tree | 0.934000 | 0.962000 | 0.958000 | 0.960000 |
| Decision Tree (Oversampled) | 0.911000 | 0.917000 | 0.975000 | 0.945000 |
| Random Forest (Undersampled) | 0.898000 | 0.919000 | 0.944000 | 0.931000 |
| Gradient Boost (Undersampled) | 0.896000 | 0.916000 | 0.945000 | 0.930000 |
| AdaBoost (Undersampled) | 0.892000 | 0.915000 | 0.943000 | 0.929000 |
| Bagging (Undersampled) | 0.864000 | 0.903000 | 0.918000 | 0.910000 |
| Decision Tree (Undersampled) | 0.825000 | 0.882000 | 0.887000 | 0.885000 |
Original Dataset
Oversampled Dataset
Undersampled Dataset
We will tune the top 3 models using Random Search CV
1. Random Forest
from sklearn.model_selection import RandomizedSearchCV
# we are tuning some hyperparameters right now, we are passing the different values for both parameters
grid_param = {
"n_estimators": np.arange(10, 40, 10),
"min_samples_leaf": np.arange(5, 10),
"min_samples_split": [3, 5, 7],
"max_features": ["sqrt", "log2"],
"max_samples": np.arange(0.3, 0.7, 0.1),
}
randomforest_tunned = RandomizedSearchCV(estimator = randomforest,
param_distributions = grid_param,
cv = 5,
n_jobs = -1)
randomforest_tunned.fit(X_train, y_train)
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=40),
n_jobs=-1,
param_distributions={'max_features': ['sqrt', 'log2'],
'max_samples': array([0.3, 0.4, 0.5, 0.6]),
'min_samples_leaf': array([5, 6, 7, 8, 9]),
'min_samples_split': [3, 5, 7],
'n_estimators': array([10, 20, 30])})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=40),
n_jobs=-1,
param_distributions={'max_features': ['sqrt', 'log2'],
'max_samples': array([0.3, 0.4, 0.5, 0.6]),
'min_samples_leaf': array([5, 6, 7, 8, 9]),
'min_samples_split': [3, 5, 7],
'n_estimators': array([10, 20, 30])})RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
min_samples_leaf=7, min_samples_split=5, n_estimators=30,
random_state=40)RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
min_samples_leaf=7, min_samples_split=5, n_estimators=30,
random_state=40)best_parameter = randomforest_tunned.best_params_
print(best_parameter)
{'n_estimators': 30, 'min_samples_split': 5, 'min_samples_leaf': 7, 'max_samples': 0.6000000000000001, 'max_features': 'log2'}
randomforest_tunned.best_score_
0.9367073547087678
rfcl_tuned = RandomForestClassifier(n_estimators=30, min_samples_split=5, min_samples_leaf=7, max_samples=0.6000000000000001, max_features='log2')
rfcl_tuned.fit(X_train, y_train)
RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
min_samples_leaf=7, min_samples_split=5,
n_estimators=30)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
min_samples_leaf=7, min_samples_split=5,
n_estimators=30)2. Bagging
# we are tuning some hyperparameters right now, we are passing the different values for both parameters
grid_param = {
"n_estimators": np.arange(10, 40, 10),
"max_features": np.arange(0, 15),
"max_samples": np.arange(0.3, 0.7, 0.1),
"bootstrap": [True, False],
}
bagging_tunned = RandomizedSearchCV(estimator = bagging,
param_distributions = grid_param,
cv = 5,
n_jobs = -1)
bagging_tunned.fit(X_train, y_train)
RandomizedSearchCV(cv=5, estimator=BaggingClassifier(random_state=40),
n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'max_features': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]),
'max_samples': array([0.3, 0.4, 0.5, 0.6]),
'n_estimators': array([10, 20, 30])})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=BaggingClassifier(random_state=40),
n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'max_features': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]),
'max_samples': array([0.3, 0.4, 0.5, 0.6]),
'n_estimators': array([10, 20, 30])})BaggingClassifier(bootstrap=False, max_features=14, max_samples=0.5,
n_estimators=20, random_state=40)BaggingClassifier(bootstrap=False, max_features=14, max_samples=0.5,
n_estimators=20, random_state=40)best_parameter = bagging_tunned.best_params_
print(best_parameter)
{'n_estimators': 20, 'max_samples': 0.5, 'max_features': 14, 'bootstrap': False}
bagging_tunned.best_score_
0.9526315255173309
3. Gradient Boost
# we are tuning some hyperparameters right now, we are passing the different values for both parameters
GradientBoostingClassifier()
grid_param = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
gb_tunned = RandomizedSearchCV(estimator = gradientboost,
param_distributions = grid_param,
cv = 5,
n_jobs = -1)
gb_tunned.fit(X_train, y_train)
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(random_state=40),
n_jobs=-1,
param_distributions={'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 300]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(random_state=40),
n_jobs=-1,
param_distributions={'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 300]})GradientBoostingClassifier(learning_rate=0.2, min_samples_leaf=2,
min_samples_split=10, n_estimators=200,
random_state=40)GradientBoostingClassifier(learning_rate=0.2, min_samples_leaf=2,
min_samples_split=10, n_estimators=200,
random_state=40)best_parameter = gb_tunned.best_params_
print(best_parameter)
{'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 3, 'learning_rate': 0.2}
gb_tunned.best_score_
0.9723855293506155
# Create a list of model names and scores
model_names = ['Random Forest', 'Bagging', 'Gradient Boost']
# Create a list of dictionaries, where each dictionary represents a model's scores
model_scores = [
{'Random Forest' : randomforest_tunned.best_score_},
{'Bagging' : bagging_tunned.best_score_},
{'Gradient Boost' : gb_tunned.best_score_}
]
# Create a DataFrame from the list of dictionaries
df_best_model_scores = pd.DataFrame(model_scores, index=model_names)
# Display the DataFrame
df_best_model_scores
| Random Forest | Bagging | Gradient Boost | |
|---|---|---|---|
| Random Forest | 0.936707 | NaN | NaN |
| Bagging | NaN | 0.952632 | NaN |
| Gradient Boost | NaN | NaN | 0.972386 |
Gradient Boost is well performing model.Random Forest also will be do well but here we are not prefer it because of its computational cost.By addressing the identified insights and implementing the recommendations, Thera Bank can potentially reduce customer churn and enhance its overall customer retention strategy.